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Abstract 

We  give  an  overview  of  the  emerging  field  of  structural  genomics,  describing 
how  genomes  can  be  compared  in  terms  of  protein  structure.  As  the  number  of  genes  in  a 
genome  and  the  total  number  of  protein  folds  are  both  quite  limited,  these  comparisons 
take  the  form  of  surveys  of  a  finite  parts  list,  similar  in  respects  to  demographic  censuses. 
Fold  surveys  have  many  similarities  with  other  whole-genome  characterizations,  e.g. 
analyses  of  motifs  or  pathways.  However,  structure  has  a  number  of  aspects  that  make  it 
particularly  suitable  for  comparing  genomes,  namely  the  way  it  allows  for  the  precise 
definition  of  a  basic  protein  module  and  the  fact  that  it  has  a  better  defined  relationship  to 
sequence  similarity  than  does  protein  function.  An  essential  requirement  for  a  structure 
survey  is  a  library  of  folds,  which  groups  the  known  structures  into  “fold  families.”  This 
library  can  be  built  up  automatically  using  a  structure-comparison  program,  and  we 
described  how  important  objective  statistical  measures  are  for  assessing  similarities 
within  the  library  and  between  the  library  and  genome  sequences.  After  building  the 
library,  one  can  use  it  to  count  the  number  of  folds  in  genomes,  expressing  the  results  in 
the  form  of  Venn  diagrams  and  "top- 10"  statistics  for  shared  and  common  folds. 
Depending  on  the  counting  methodology  employed,  these  statistics  can  reflect  different 
aspects  of  the  genome,  such  as  the  amount  of  internal  duplication  or  gene  expression. 
Previous  analyses  have  shown  that  the  common  folds  shared  between  very  different 
microorganisms  -  i.e.  in  different  kingdoms  -  have  a  remarkably  similar  structure,  being 
comprised  of  repeated  strand-helix-strand  super-secondary  structure  units.  A  major 
difficulty  with  this  sort  of  “fold-counting”  is  that  only  a  small  subset  of  the  structures  in  a 
complete  genome  are  currently  known  and  this  subset  is  prone  to  sampling  bias.  One  way 
of  overcoming  biases  is  through  structure  prediction,  which  can  be  applied  uniformly  and 
comprehensively  to  a  whole  genome.  Various  investigators  have,  in  fact,  already  applied 
many  of  the  existing  techniques  for  predicting  secondary  structure  and  transmembrane 
(TM)  helices  to  the  recently  sequenced  genomes.  The  results  have  been  consistent: 
Microbial  genomes  have  similar  fractions  of  strands  and  helices  even  though  they  have 
significantly  different  amino-acid  composition.  The  fraction  of  membrane  proteins  with  a 
given  number  of  TM-helices  falls  off  rapidly  with  more  TM  elements,  approximately 
according  to  a  Zipf  Law.  This  latter  finding  indicates  that  there  is  no  preference  for  the 
highly  studied  7-TM  proteins  in  microbial  genomes.  Continuously  updated  tables  and 
further  information  pertinent  to  this  review  is  available  over  the  web  at 
http :  //bioinfo  .mbb .  y  ale .  edu/genome . 
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Introduction 

The  Sequencing  of  Complete  Genomes  Highlights  the  Finiteness  of  Molecular 
Biology 

In  the  last  three  years  a  number  of  microbial  genomes  have  been  completely 
sequenced,  generating  tremendous  interest,  popular  as  well  as  scientific  [1-3].  In 
particular,  in  1995  the  first  genome  of  a  free-living  organism,  the  bacteria  H.  influenzae, 
was  sequenced  by  Venter  and  colleagues,  and  two  years  later  another  landmark  was 
reached  with  the  publication  of  the  yeast  genome,  a  significantly  more  complex  genome 
of  a  eukaryote  [4,  5]. 

One  of  the  most  important  points  highlighted  by  having  a  complete  genome  sequence  is 
the  essential  finiteness  of  molecular  biology.  That  is,  the  complete  sequence,  while 
complex,  describes  all  the  parts  necessary  for  microbial  life. 

A  Structural  Census,  the  Connection  between  Genomes  and  Structures 

Simultaneous  with  all  the  progress  being  made  in  genomics,  there  is  a  tremendous 
investment  being  made  in  structural  biology.  This  is  yielding  great  returns  in  the  form  of 
an  exponentially  increasing  number  of  protein  structures.  All  these  structures  fall  into  a 
very  limited  number  of  folding  patterns,  currently  about  350  [6-10].  It  is  believed, 
furthermore,  that  we  will  eventually  find  that  all  naturally  occurring  protein  structures  are 
composed  of  very  small  number  of  folds,  estimated  to  be  -1000  [11]. 

The  objective  of  this  work  is  to  discuss  various  means  of  understanding  this  finite 
universe  of  genes  in  terms  of  an  even  more  limited  repertoire  of  protein  folds.  This  is  the 
subject  of  the  new  field  of  structural  genomics  [12,  13].  One  can  achieve  some  form  of 
understanding  by  performing  large-scale  surveys,  looking  at  the  occurrence  of  protein 
structures  and  various  protein  structural  features  in  the  genomes  of  different  organisms. 
We  use  the  term  “structural  censuses”  to  describe  these  surveys,  emphasizing  the  intent  to 
provide  a  comprehensive  accounting. 

To  do  such  a  structural  census  properly,  one  needs  to  cluster  together  3D  structures 
into  a  library  of  folds  and  then  to  match  up  genome  sequences  to  structures  in  this  library. 
One  also  needs  a  way  to  characterize  the  sequences  without  structural  homologues  in 
rough  structural  terms.  This  is  usually  done  via  various  prediction  techniques,  such  as 
those  for  secondary  structure  or  transmembrane  helices.  Then  one  does  “fold  counting,” 
enumerating  how  often  a  fold  or  structural  feature  occurs  in  a  given  genome  or  organism. 
These  specific  aspects  of  a  structure  census  will  be  discussed  at  length.  But  before  doing 
so  it  is  worthwhile  to  provide  some  perspective  on  the  general  questions  addressed  and 
how  this  work  relates  to  other  types  of  genomic  analysis. 

The  Overall  Question:  At  What  Structural  Resolution  Do  Organisms  Differ? 

One  interesting  question  addressed  by  a  census  of  structures  is  to  what  degree 
certain  folds  occur  only  in  certain  branches  of  the  “evolutionary  tree.”  To  put  it  in 
somewhat  extreme  terms,  can  one  explain  the  obvious  morphological  differences 
between  two  microorganisms  (e.g.  between  yeast  and  E.  coli )  in  terms  of  their  having 
different  protein  folds?  Alternatively,  it  may  be  that  most  folds  occur  in  every  organism 
in  the  same  way  that  the  genetic  code  and  many  basic  biochemical  pathways  (such  as 
glycolysis)  are  almost  universally  shared.  Currently,  it  is  only  possible  to  answer  this 
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question  anecdotally,  in  terms  of  individual  structures.  One  can  find  evidence  for  either 
viewpoint.  On  one  hand,  the  immunoglobulin  fold,  which  is  usually  closely  associated 
with  the  eukaryotes  (e.g.  in  the  vertebrate  immune  system),  has  been  found  in  bacteria, 
where  it  carries  out  a  very  different  function  [14].  On  the  other  hand,  the  small  DNA- 
binding  fold  known  as  the  zinc  finger  so  far  appears  to  be  confined  to  eukaryotes  [15]. 

This  question  can  be  rephrased  as,  "At  what  structural  resolution  do  organisms 
differ?"  Structurally,  microorganisms  appear  different  on  the  micron  scale,  as  they  have 
different  internal  cell  structures,  but  on  the  scale  of  single  Angstroms  they  appear  nearly 
the  same,  containing  similar  proportions  of  C,  H,  O,  N,  P,  and  S  atoms  (Fig.  1).  At  what 
structural  resolution  can  one  start  seeing  differences?  It  is  probably  not  at  the  level  of 

o 

secondary  structure  (~10  A)  since  all  organisms  are  composed  of  essentially  similar 
proportions  of  alpha  helices  and  beta  sheets  (see  below).  Is  it  at  the  level  of  protein  super¬ 
secondary  structure  (e.g.  four-helix  bundles  or  beta-alpha-beta  units)  or  at  the  level  of 
whole  domain  folds?  Or  perhaps  it  is  at  a  higher  level,  involving  the  large-scale 
organization  and  regulation  of  essentially  identical  protein  parts. 

This  question  is  especially  interesting  when  one  considers  the  diverse  physical 
environments  inhabited  by  these  organisms  —  from  high  temperature  and  pressure  for 
Methanococcus ,  to  normal  temperature  and  pressure  for  yeast,  to  high  acid  for 
Helicobacter. 

A  Structural  Census  as  a  Particular  Type  of  “Occurrence 
Analysis”  in  Genomics 

Analyzing  the  occurrence  or  frequency  of  folds  in  genomes  is  a  particular 
example  of  a  general  type  of  comparative  genomics  we  dub  “occurrence  analysis.”  This 
involves  comparing  how  often  a  particular  entity  (e.g.  a  sequence  motif)  occurs  in  various 
genomes,  and  seeing  what  fraction  of  a  collection  of  entities  occurs  in  one  genome  as 
compared  to  another.  Several  different  types  of  occurrence  analysis  have  been  previously 
performed,  studying  genomes  at  many  different  levels. 

Starting  from  the  most  basic  units,  genomes  have  been  compared  in  terms  of  the 
relative  frequencies  of  short  oligonucleotide  and  oligopeptide  “words”  [16-19]. 

On  the  level  of  individual  genes  or  proteins,  the  degree  of  gene  duplication  in  a 
number  of  genomes  has  been  ascertained  [20-25].  Other  works  have  investigated  the 
occurrence  of  conserved  families  in  several  different  genomes  [26-30] .  This  can  be 
performed  on  a  large-scale  in  a  highly  automated  fashion  [31-36].  The  recent  growth  of 
databases  makes  such  automatic  and  objective  systems  highly  desirable.  In  particular, 
with  the  data  of  many  complete  genomes  now  available,  the  often  arbitrary  functional 
assignment  of  homologous  genes  can  be  replaced  with  a  system  of  orthologs  and  paralogs 
(genes  with  a  common  ancestor,  separated  by  speciation  and  presumably  performing  the 
same  function,  versus  genes  generated  by  duplication  within  the  same  organism).  A  semi¬ 
automatic  approach  was  recently  developed  that  compared  several  genomes  and  derived 
clusters  of  orthologous  groups  (COGS)  [28].  The  approach  is  straightforward:  If  one 
knows  all  the  potential  candidates  in  a  genome  for  a  certain  protein  function,  one  can  pick 
the  best  one  based  on  the  best  match  to  a  protein  of  known  function.  If  the  best  matches 
occur  consistently  among  the  same  group  of  proteins  from  several  distantly  related 
genomes,  the  proteins  are  classified  as  COGS. 
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An  important  application  of  single-gene  occurrence  analysis  is  “differential 
genomics.”  When  two  closely  related  genome  sequences  are  compared,  the  difference, 
i.e.  those  genes  that  are  present  only  in  one  of  them,  may  give  a  clue  to  the  unique  nature 
of  the  microbe  in  question.  For  example,  a  comparison  between  E.  coli  and  H.  influenzae 
revealed  116  genes  that  are  present  only  in  the  latter  [37].  Differential  genomics  may 
have  useful  applications  for  attacking  microbe-related  diseases  [38,  39],  e.g.  finding 
genes  unique  to  pathogenic  organisms  can  help  in  developing  antibiotics  against  them. 

Occurrence  analysis  can  also  be  carried  out  on  the  level  of  whole  metabolic 
pathways  and  systems  [40-42].  This  work  has  yielded  many  interesting  conclusions  in 
terms  of  the  pathways  that  are  modified  or  absent  in  certain  organisms.  For  instance, 
many  of  the  respiratory  enzymes  in  E.  coli  are  missing  in  H.  influenzae,  and  the 
metabolism  in  the  latter  seems  to  be  biased  to  a  relatively  nitrogen-rich  and  anaerobic 
environment  [4,  43,  44]. 

Why  Analysis  of  Structure  is  Particularly  Advantageous  for 
Genome  Comparison 

The  analysis  of  structure  is  expected  to  be  particularly  advantageous  for  genome 
comparison  for  two  reasons. 

Structural  Modules  are  Precisely  Defined  and  Relatively  Few  in  Number 

First,  structure  allows  one  to  define  a  protein  module  (or  shared  part)  in  both  a 
more  precise  and  more  general  sense. 

It  is  possible  (and  quite  productive)  to  define  modules  purely  in  terms  of 
conserved  “blocks”  in  sequence  alignments  or  small,  but  distinctive,  “motifs”  shared  by 
many  related  proteins  [45-58].  However,  functioning  protein  modules  fundamentally 
consist  of  units  of  3D  structure.  In  fact,  it  is  usually  believed  that  these  structural  units 
form  physically  interacting  "folding  domains,"  and  attempts  have  been  made  to  see  how 
well  they  correspond  to  exon  boundaries  and  other  linear  sequence  features  [59-61].  This 
is  often  not  a  simple  relationship  as  many  structural  modules  are  discontinuous  in  terms 
of  sequence  —  as  when  a  polypeptide  chain  starts  in  one  domain,  goes  through  a  hinge 
region  into  a  second  domain,  and  then  returns  to  the  first  domain.  Nevertheless,  relating 
modules  defined  on  the  sequence  level  to  structure  enables  them  to  be  better 
characterized.  This  is  especially  true  for  groups  of  aligned  structures,  which  allow  the 
definition  of  a  conserved  structural  core  [62,  63]. 

Also,  one  expects  analysis  of  structure  to  reveal  more  about  distant  evolutionary 
relationships  than  sequence  comparison,  since  structure  is  more  conserved  than  sequence 
or  function  [64,  65].  In  other  words,  it  is  at  the  level  of  protein  structure  where  the 
biologists  sees  the  fewest  “parts”  and  greatest  amount  of  redundancy  and  reuse. 

Similarity  in  Sequence  is  More  Closely  Related  to  Similarity  in  Structure  than  in 
Function 

A  second  reason  that  structural  analysis  is  useful  for  genome  comparisons  is  that  the 
relationship  between  sequence  similarity  and  structural  similarity  is  much  better  defined 
than  the  corresponding  relationship  between  sequence  and  function. 

It  is  generally  accepted  that  proteins  with  similar  sequences  usually  have  similar 
structures.  A  decade  ago  Lesk  &  Chothia  systematically  investigated  the  relationship 


5 


between  divergence  in  sequence  and  that  in  structure  [64,  66].  Using  the  limited  amount 
of  data  available  at  the  time  (32  pairs  of  homologous  structures  among  25  proteins),  they 
found  that  the  extent  of  the  structural  changes  is  directly  related  to  the  extent  of  the 
sequence  changes.  As  shown  in  figure  2,  we  have  repeated  the  calculations  here  using  a 
much  larger  data  set.  (Details  of  the  calculations  are  described  in  the  legend.)  Expressing 
sequence  similarity  in  terms  of  the  more  modem  statistical  terminology  (i.e.  P-value 
instead  of  percentage  identity),  we  find  very  similar  results  to  the  original  work  of  Lesk 
&  Chothia.  There  are,  of  course,  exceptions  where  similarity  in  sequence  does  not  imply 
similarity  in  structure.  These  usually  occur  for  small  proteins,  e.g.  an  artificially  designed 
sequence  of  a  four-helix  bundle  could  be  made  more  than  50%  identical  to  a 
predominantly  beta-sheet  protein  [67,  68]. 

The  relationship  between  sequence  similarity  and  functional  similarity  is  much  less 
clear  [69].  In  part,  this  is  because  it  is  much  more  difficult  to  precisely  specify  a  function 
than  a  sequence  or  a  structure.  Moreover,  even  in  cases  where  the  functional 
identification  is  well  specified,  there  are  several  examples  where  highly  similar  sequences 
have  completely  different  functions  -  i.e.  same  fold  but  different  function.  A  well-known 
example  is  the  structural  protein  eye-lens  crystallin  and  the  metabolic  enzyme  gluthatione 
S -transferase  [70],  which  have  sequence  and  structural  similarity  but  differ  in  function. 

An  extreme  example  is  provided  by  the  enzymes  lactate  dehydrogenase  and  malate 
dehydrogenase.  In  protein  engineering  experiments,  Wilks  et  al.  managed  to  convert  one 
into  the  other  by  changing  only  a  single  amino  acid  [71]. 

The  opposite  situation  can  also  be  observed,  namely  when  the  same  function  is 
performed  by  several  proteins  unrelated  in  structure  and  sequence  -  i.e.  same  function  but 
different  fold.  A  good  example  is  chloroperoxidase,  which  has  an  alpha/beta  fold  in  the 
prokaryote  Pseudomonas  but  has  an  all-alpha  fold  in  fungi  [72,  73].  There  are  many 
more  examples  of  this  type  of  convergent  evolution  in  enzymes  [74]. 

Elements  of  a  Structural  Census:  Construction  of  a  Fold  Library 

Thus  far,  we  have  described  how  comparing  genomes  in  terms  of  structures  is  a 
particular  form  of  “occurrence  analysis”  and  how  structure  provides  a  particularly 
advantageous  subject  for  comparison.  Now  we  outline  what  goes  into  a  structure  census, 
its  methodological  "elements,"  and  discuss  some  conclusions  from  recent  work.  An 
essential  element  in  a  survey  of  known  structures  is  the  construction  of  a  library  of  folds. 
This  is  expected  to  be  an  essential  data  structure  in  molecular  biology,  organizing  the 
collection  of  gene  families  like  the  columns  in  the  chemical  periodic  table  [75]. 

Pairwise  Structural  Comparison  and  Alignment:  Automatic  i/s  Manual 

To  build  a  fold  library,  one  must  have  a  way  of  comparing  and  aligning  protein 
structures  (see  figure  3).  One  approach  is  to  do  this  manually,  the  approach  taken  for  the 
scop  classification  of  protein  structures  [7].  On  another  extreme,  there  are  a  number  of 
algorithms  for  automatically  comparing  structures  and  clustering  them  into  fold  families 
[76-89].  Finally,  there  is  a  hybrid  approach,  based  on  both  automatic  and  manual 
comparison  [10,  90]. 

Completely  automatic  methods  have  the  advantage  of  speed  and  objectivity. 
However,  the  fold  classifications  produced  by  a  computer  are  not  always  as 
understandable  or  reliable  as  those  produced  by  humans.  Furthermore,  although  manual 
classification  is  slow,  if  it  is  done  correctly,  it  only  has  to  be  done  once. 
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Various  Automatic  Methods  for  Structural  Comparison 

To  get  a  perspective  on  the  automatic  methods,  it  is  useful  to  compare  structural 
alignment  with  the  much  more  thoroughly  studied  methods  for  sequence  alignment  [91, 
92].  Both  methods  produce  an  alignment,  which  can  be  described  as  an  ordered  set  of 
equivalent  pairs  (i,j)  associating  residue  i  in  protein  A  with  residue  j  in  protein  B.  Both 
methods  allow  gaps  in  these  alignments  which  correspond  to  non- sequential  i  (or  j) 
values  in  consecutive  pairs  —  i.e.  one  has  pairs  like  (10,20)  and  (11,  22).  And  both 
methods  reach  an  alignment  by  optimizing  a  function  that  scores  well  for  good  matches 
and  badly  for  gaps.  The  major  difference  between  the  methods  is  that  the  optimization 
used  for  sequence  alignment  is  globally  convergent  whereas  that  used  for  structural 
alignment  is  not.  This  is  the  case  for  sequence  alignment  because  the  optimum  match  for 
one  part  of  a  sequence  is  not  affected  by  the  match  for  any  other  part.  Structural 
alignment  fails  to  converge  globally  because  the  possible  matches  for  different  segments 
are  tightly  coupled,  as  they  are  part  of  the  same  rigid  3D  structure. 

This  lack-of-convergence  has  led  to  a  large  number  of  different  approaches  to 
structural  alignment,  the  methods  differing  in  how  they  attack  the  problem.  No  current 
algorithm  works  all  of  the  time  (i.e.  for  all  the  pathological  cases).  The  methods  also 
differ  in  the  function  they  optimize  (the  equivalent  of  the  amino  acid  substitution  matrix 
used  in  sequence  alignment)  and  how  they  treat  gaps.  Some  of  the  methods  effectively 
compare  the  respective  distance  matrices  of  each  structure,  trying  to  minimize  the 
difference  in  intra-atomic  distances  for  selected  aligned  substructures  [80,  83,  93].  Other 
approaches,  in  contrast,  directly  try  to  minimize  the  inter-atomic  distances  between  two 
structures,  using  repeated  application  of  dynamic  programming  [77,  89,  90,  94,  95].  This 
allows  structures  to  be  aligned  in  a  similar  fashion  to  normal  sequence  alignment  [96].  A 
similar  approach  is  taken  in  minimizing  the  "soap-bubble  area"  between  two  structures 
[87].  Other  methods  involve  other  techniques,  such  as  geometric  hashing  or  lattice  fitting 
[79,85,86], 

Fusing  a  Multiple  Alignment  into  a  Structural  Template 

The  classification  of  the  entire  databank  using  a  variety  of  the  automatic  and 
manual  procedures  outlined  above  has  recently  been  undertaken  by  a  number  of  groups 
[7,  83,  97-101],  resulting  in  the  scop,  FSSP,  LPFC,  CATH,  and  HOMALDB  databases. 
These  databases  group  the  known  structures  into  -350  fold  families,  some  of  which  are 
quite  large  (e.g.  currently  the  PDB  contains  over  166  antibody  structures).  Because  of  the 
great  numbers  of  structures  and  of  families,  it  is  worthwhile  to  summarize  the  common 
features  within  a  family,  whilst  separating  out  the  variable  ones.  That  is,  one  wants  to 
know  which  regions  are  conserved  and  which  are  highly  variable,  and  to  fuse  all  the 
conserved  regions  into  a  single  “core  structure”  template  (figure  3).  A  number  of 
approaches  have  been  developed  to  tackle  this  problem  through  determining  a  mean  and 
variance  for  an  ensemble  of  multiply  aligned  structures  and  then  picking  the  low  variance 
atoms  as  “core”  [8,  62,  102,  103]. 
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Searching  the  Genome  with  Structural  Templates 

Clustering  the  Structure  Databank  into  Sequence  Templates 

Once  a  library  of  folds  has  been  constructed,  one  wants  to  build  sequence 
templates  based  on  it  and  then  use  these  to  search  the  genome.  A  necessary 
methodological  preliminary  is  clustering  the  known  structures  into  a  number  of 
(sequence)  representative  domains,  using  a  variety  of  single  or  multiple  linkage 
approaches  [6,  67,  104-106].  Currently,  the  PDB  can  be  clustered  in  -1200  representative 
domains.  Then  using  structure  comparison,  one  finds  that  these  representatives  are 
distributed  amongst  338  folds,  giving  about  three  sequence  families  per  fold  [6].  The  fact 
that  the  number  of  folds  is  so  much  less  than  the  number  of  sequence  families  highlights 
the  fact  that  many  of  the  evolutionary  similarities  between  highly  diverged  organisms 
may  only  be  apparent  in  terms  of  structure  [107].  Folds  can,  in  turn,  be  ranked  the 
number  of  different  families  of  non-homologous  sequences  they  are  associated  with. 

Folds  uniting  many  distinct  sequence  families  have  been  dubbed  superfolds  [108].  These 
may  represent  intrinsically  stable  and  favorable  structural  arrangements,  as  suggested  by 
a  variety  of  analyses  [108-110]. 

At  this  point  one  has  -350  3D-structural  alignments,  each  of  which  “connects”  a 
number  of  non-homologous  sequences.  These  can  be  used  as  “seeds”  to  build  up  large 
sequence  alignments  from  the  major  databases  using  standard  pairwise  searching  tools  - 
e.g.  the  popular  BLAST  and  FASTA  programs  on  the  SwissProt  and  GenBank  databases 
[111-115].  A  number  of  recently  developed  methods  of  transitive  sequence  matching 
(through  a  third  intermediate  sequence)  are  expected  to  improve  the  sensitivity  of  these 
pairwise  searches  somewhat  [116-119]. 

As  many  of  these  alignments  contain  quite  a  few  sequences,  it  can  be  advantageous  to 
fuse  them  into  a  consensus  pattern  or  template,  just  as  is  done  with  structures  [62]  (Fig. 
3).  For  this,  a  variety  of  probabilistic  approaches  can  be  used.  A  most  popular 
representations  is  the  Hidden  Markov  Model  (HMM)  [120-125].  This  is  a  generalization 
of  the  sequence  profile,  and  like  a  profile  it  gives  an  explicit  probability  for  each  of  the  20 
amino  acids  to  occur  at  each  position  in  the  model  [126].  The  HMM  goes  beyond  a 
profile  in  associating  with  each  position  an  explicit  probability  for  introducing  a  gap 
(either  for  insertion  or  deletion). 

Microbial  Genome  Sequences 

Once  formed,  sequence  templates  can  be  compared  directly  against  the  genomes. 
This  can  take  place  in  a  variety  of  ways.  The  most  straightforward  is  to  just  compare  each 
sequence  in  the  template  against  the  genome  using  the  standard  pairwise  comparison 
programs  (e.g.  FASTA,  BLAST,  or  straight  Smith- Waterman  [111-113,  127]). 
Alternately,  one  can  use  profile  or  HMM  searching  programs  for  those  sequences  that  are 
part  of  an  explicit  pattern.  However,  in  doing  this  one  has  to  consider  some  important 
issues  related  to  bias  (see  below). 

At  the  time  of  this  writing  there  are  13  microbial  genome  sequences  currently 
available  (Table  1).  These  already  provide  a  most  diverse  comparison  —  representing 
microbes  from  the  three  kingdoms  of  life  (Eukarya,  Eubacteria,  Archea),  from  different 
environments  (room  temperature  and  pressure  to  high  temperature  and  pressure,  and 
neutral  pH  to  highly  acidic),  with  a  wide  range  of  genome  sizes  (0.6  to  13  Mb),  and  with 
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a  variety  of  modes  of  life  (from  parasite  to  autotroph). 

One  point  worth  mentioning  is  that  the  genome  data  is  constantly  changing  and  is 
contingent  on  the  current  “state  of  the  art”  in  gene  finding.  The  data  used  in  any  analysis 
reflects  a  particular  snapshot  of  this  ongoing  process.  For  instance,  the  current  E.  coli 
data  file  is  version  M52,  containing  4290  ORFs.  This  is  a  more  recent  version  and 
contains  a  different  number  of  ORFs  than  the  one  referred  to  in  the  official  publication 
(M49,  containing  4288  ORFs)  [128].  For  yeast  there  is  some  uncertainty  regarding 
whether  all  of  the  ORFs  in  the  web  site  file  are  really  genes.  In  particular,  5888  of  the 
6218  ORFs  are  definitely  believed  to  be  genes,  but  there  is  some  question  about  the 
remaining  330  [129].  Furthermore,  quite  a  number  of  yeast  sequences  (initially) 
annotated  to  be  ORFs  are,  in  fact,  transposons,  which  should  properly  be  segregated  from 
the  rest  of  the  proteome  [130]. 

Similarity  in  Both  Sequence  and  Structure  is  Best  Described 
Statistically 

Similarities  are  best  expressed  statistically  in  terms  of  a  P-value 

The  preceding  section  was  concerned  with  comparison,  both  for  structure  and 
sequence.  To  do  this  right,  one  needs  to  be  able  to  assess  the  significance  of  a  given 
comparison  score  -  i.e.  what  does  a  score  of  392  mean?  This  is  often  quite  subtle  and,  in 
a  sense,  relates  to  the  fundamental  problem  of  what  constitutes  similarity  in  biology. 
Moreover,  it  is  a  most  important  issue  with  respect  to  large-scale  genome  surveys,  which 
involve  hundreds  of  thousands  of  comparisons.  It  is  essential  to  have  a  rapid  and 
automatic  method  to  assess  the  significance  of  a  given  comparison  score  (i.e.  to  set  a 
threshold),  as  it  is  neither  possible  nor  desirable  to  do  this  by  hand. 

The  best  way  to  assess  significance  is  to  see  how  a  particular  similarity  score  compares 
in  a  statistical  sense  to  all  the  others.  A  major  development  in  the  past  few  years  has  been 
the  implementation  of  probabilistic  scoring  schemes  for  doing  just  this  [131-137].  These 
give  the  significance  of  a  match  in  terms  of  a  P-value  rather  than  an  absolute,  “raw”  score 
(such  as  percent  identity  or  RMS).  A  P-value  is  the  chance  that  one  would  get  a  given 
similarity  score  (or  better)  from  a  random  alignment.  That  is,  P(s  >  S)  =  .01  means  that  a 
randomly  generated  score  s  would  be  greater  than  the  threshold  score  S  (e.g.  392)  1%  of  the 
time.  The  P-value  gives  the  rank  of  a  score  relative  to  all  the  other  possible  scores.  It  places 
scores  from  very  different  programs  in  a  common  framework  and  provides  an  obvious 
way  to  set  a  significance  cutoff  (i.e.  at  P  <  0.0001  or  0.01%). 

P-values  are  closely  related  to  another  quantity  called  the  e-value,  which  is  the 
number  of  false  positives  expected  with  a  given  score  threshold  in  a  whole  databank 
comparison.  Thus,  the  e-value  is  just  the  databank  size  multiplied  by  the  P-value. 

Determination  of  P-values  involves  determining  the  score  distribution  for  true 
negatives,  i.e.  for  random  alignments.  This  can  be  done  in  a  number  of  ways:  simulating 
random  alignments,  analytically  deriving  the  score  distribution  for  a  random  alignment,  or 
doing  an  all-vs-all  comparison  of  the  databank  and  curve-fitting  to  the  observed  score 
distribution. 

Statistics  for  Sequence  Similarity 

For  sequences,  P-values  were  first  used  in  the  BEAST  family  of  sequence  searching 
programs,  where  they  are  derived  from  an  analytic  model  for  the  chance  of  an  arbitrary 
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ungapped  alignment  [131,  135].  P-values  have  subsequently  been  implemented  in  other 
programs  such  as  FASTA  and  gapped  BLAST  using  a  somewhat  different  formalism 
[116,  136-138].  In  all  the  formalisms,  P-values  for  sequence  comparison  are  derived  from 
an  extreme  value  distribution.  That  is,  sequence  comparison  scores  are  observed  to  follow 
a  distribution  like  exp(-S-exp(-S)),  which  has  a  much  longer  "tail"  than  the  rapidly  falling 
off  normal  distribution  exp(-S  ).  Such  a  distribution  arises  naturally  from  repeatedly 
considering  the  maximum  of  a  number  of  independent,  random  variables.  This  is  in 
contrast  to  the  normal  distribution,  which  arises  from  repeatedly  considering  sums  of 
random  variables. 

In  general,  P-values  give  similar  results  to  more  conventional  scores,  such  as 
percent  identity,  but  they  have  been  shown  to  be  better  calibrated  and  more  sensitive  for 
marginal  similarities,  taking  into  account  compositional  biases  of  the  databank  and  the 
query  sequence  [94,  132,  133].  In  particular,  Brenner  et  al.  tested  the  applicability  of 
probabilistic  scores  to  the  detection  of  structural  relationships  [67,  139,  140].  They  found 
that  the  FASTA  e-value  closely  tracked  the  error  rate  against  a  test  set  of  known 
structural  relationships.  That  is,  with  regard  to  the  number  of  false  positives,  expectation 
tracked  reality. 

Statistics  for  Structural  Similarity 

Some  of  the  current  methods  for  structural  alignment  have  associated  with  them 
probabilistic  scoring  schemes.  In  particular,  one  method  computes  a  P-value  for  an 
alignment  based  on  measuring  how  many  secondary  structure  elements  are  aligned,  as 
compared  to  the  chance  of  aligning  this  many  elements  randomly  (VAST)  [86].  Another 
method  expresses  the  significance  of  an  alignment  in  terms  of  the  number  of  standard 
deviations  it  scores  above  the  mean  alignment  score  in  an  all-vs-all  comparison  (i.e.,  a  Z- 
score)  [8,  83]. 

We  have  recently  developed  a  simple  empirical  approach  for  calculating  the 
significance  of  a  structural  alignment  score  based  on  doing  an  all-vs-all  comparison  of 
the  databank  and  then  curve  fitting  to  the  observed  score  distribution  for  the  true 
negatives  [90,  94].  We  can  apply  our  approach  consistently  to  both  sequences  and 
structures.  For  sequences,  we  compared  our  fit-based  P-values  with  the  differently 
derived  statistical  scores  from  commonly  used  programs  such  as  BLAST  and  FASTA  and 
found  substantial  agreement.  For  structure  alignment,  we  follow  a  parallel  route  to  derive 
an  expression  for  the  P-value  of  a  given  alignment  in  terms  of  a  structural  alignment 
score. 

We  find  that  scores  from  structure  alignment  follow  a  similar  extreme-value 
distribution  to  those  in  sequence  comparison,  allowing  one  to  adopt  a  uniform  statistical 
formalism  for  both  comparison  techniques.  (As  dynamic  programming  applied  to  either 
sequence  or  structure  alignment  essentially  finds  a  maximum  score  over  many  possible 
alignments,  it  is  quite  reasonable  that  this  should  be  the  case.  However,  this  is  not 
trivially  obvious,  as  the  dynamic  programming  score  does  not  result  from  considering  the 
maximum  of  truly  independent  variables.) 

A  nice  aspect  of  structural  alignment  is  that  one  can  visualize  exactly  what  is  meant 
by  a  strong  similarity  in  comparison  to  a  marginal  one.  Examples  shown  in  figure  4, 
which  shows  a  strong  similarity  (for  two  globins),  a  weaker  one  (for  two 
immunoglobulins),  and  a  very  marginal  one. 


10 


Overall  “Inventory”  Statistics  in  a  Census  Calculation 

Distribution  of  Folds  Amongst  Genomes  (Venn  Diagrams) 

After  setting  a  uniform  comparison  threshold  and  running  the  fold  library  against 
the  genomes,  it  is  possible  to  see  how  the  known  folds  are  distributed  amongst  different 
genomes,  or  partial  genomes.  There  are  a  number  of  web  sites  that  compile  this  data 
automatically  -  e.g.  PENDANT  and  GeneQuiz  [33,  141].  However,  few  detailed  analyses 
have  been  published,  mostly  because  only  recently  have  enough  complete  genomes 
become  available  for  this  sort  of  comparative  analysis. 

A  recent  work  illustrates  what  is  initially  possible  [24].  This  analysis  focussed  on 
three  of  the  first  genomes  to  be  sequenced,  the  first  ones  from  each  of  the  major 
kingdoms:  i.e.,  H.  influenzae  (a  eubacteria,  [4]),  M.  jannaschii  (an  archaeon,  [142]),  and 
S.  cerevisiae  (yeast,  a  eukaryote  [129]). 

As  shown  in  Figure  5,  the  analysis  can  be  conceptualized  in  terms  of  a  Venn  diagram, 
similar  to  those  used  for  studying  the  occurrence  of  motifs  and  sequence  families  [143, 
144].  About  half  of  the  known  folds  (148)  are  contained  in  at  least  one  of  the  three 
genomes,  and  45  folds  are  shared  amongst  all  three  genomes.  These  shared  folds 
presumably  represent  an  ancient  set  of  molecular  parts. 

It  is  possible  to  classify  each  fold  as  all-alpha,  all-beta,  alpha/beta,  alpha+beta,  or 
“other”  using  the  original  definitions  of  Levitt  &  Chothia  and  then  to  see  how  the  folds 
corresponding  to  each  structural  class  are  distributed  among  the  genomes  [145,  146]. 
Overall,  the  genomes  contain  a  disproportionate  number  of  mixed  folds  (a/p  and 
a+P,  83/148),  and  the  shared  fold  are  even  more  enriched  in  a/p  super- secondary 
structures,  with  38  of  45  having  a  mixed  architecture. 

A  related  analysis  looked  at  the  occurrence  of  folds  in  different  groups  of  organisms 
(e.g.  plants  vs.  animals)  [147].  This  did  not  involve  complete  genomes  but  rather 
partitioning  the  sequence  databank  into  a  number  of  distinct  phylogenetic  sets.  Such  an 
analysis  suffers  from  various  biases  (as  discussed  below),  but  it  is  nevertheless 
suggestive,  showing  that  more  closely  related  organisms  had  a  greater  number  of  folds  in 
common. 

It  is  expected  that  many  more  analyses  such  as  these  will  be  undertaken  in  the 
future  as  more  genomes  are  sequenced  and  structures  determined  [148].  It  is  difficult  to 
express  the  shared  folds  amongst  more  than  three  genomes  in  terms  of  a  Venn  diagram, 
so  other  representations  become  useful,  such  as  cluster  trees  [149]. 

Frequency  that  Folds  Occur  in  a  Genome  (“Top-10  lists”) 

Another  simple  statistic  to  look  at  is  how  often  a  particular  known  fold  occurs  in  a 
genome,  i.e.  the  fold  frequency.  In  the  previous  work  comparing  three  genomes,  these 
frequencies  were  expressed  in  terms  of  “top- 10”  lists  for  the  most  common  folds  in  a 
genome  [24].  As  was  the  case  for  the  folds  overall,  most  of  the  common  folds  have  an 
a/p  architecture. 

Combining  the  frequent  fold  analysis  with  the  Venn  diagram,  one  can  determine 
the  common  folds  that  are  shared  by  all  genomes.  As  shown  in  figure  6,  ordered  in  terms 
of  their  frequency  of  occurrence,  the  top-five  common  and  shared  folds  when  comparing 
yeast,  Haemophilus  influenzae,  and  Methanococcus  jannaschii  are  the  P-loop  containing 
NTP  hydrolase  fold,  the  Rossmann  fold,  the  TIM-barrel  fold,  the  flavodoxin  fold,  and  the 
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Thiamin-binding  fold.  Each  of  these  folds  is  associated  with  basic  metabolism  (as 
opposed  to  other  functions  such  as  transcription  or  regulation).  They  are  all  classic  a/(l 
proteins  and  share  a  remarkably  similar  super-secondary  structure  architecture,  with  a 
central  sheet  of  parallel  strands  with  helices  packed  onto  at  least  one  face  of  this  sheet. 
Moreover,  the  topology  of  the  central  sheet  is  very  similar  in  all  the  proteins.  Almost  all 
of  the  connections  are  right-handed  links  between  adjacent  parallel  strands  through  an 
intervening  helix  packed  onto  the  central  sheet. 

These  top- 10  lists  rank  folds  by  how  often  they  occur  in  the  genome,  tending  to 
emphasize  highly  duplicated  genes.  Folds  can  also  be  ranked  by  a  number  of  other 
criteria.  For  instance,  they  can  be  ranked  by  the  number  of  non-homologous  sequence 
families  they  are  associated  with,  i.e.  their  superfold  ranking.  This  number  is  not  always 
correlated  with  how  often  the  fold  occurs  in  microbial  genomes,  but  it  is  the  case  that 
superfolds  are  among  the  most  common  folds  found  in  genomes.  Folds  can  also  be 
ranked  in  terms  of  expression  level,  essentially  a  ranking  by  mRNA  occurrence  in  the 
cell.  This  has  already  been  done  in  non-structural  terms  for  all  the  genes  in  yeast  [150- 
152].  In  table  2,  we  see  how  this  expression  level  ranking  maps  onto  folds.  Using  data 
from  DeRisi  et  al.  [152],  the  figure  shows  the  most  highly  expressed  folds  in  yeast  grown 
in  two  different  conditions  (high  sugar  and  low  sugar,  aerobic  vs.  anaerobic  conditions). 
The  ranking  of  folds  is  clearly  different  from  that  purely  based  on  duplication. 

The  Problem  of  Sampling  Bias  Affects  the  Statistics 

General  Issue  of  Bias  in  the  Databanks 

One  of  the  most  important  issues  in  doing  a  large-scale  survey  is  avoiding  biases. 
Because  of  the  preferences  of  investigators,  some  types  of  sequences  or  structures  are 
over-represented  and  others  are  under-represented  in  the  databanks.  For  instance,  in 
GenBank  there  is  an  over-representation  of  globins  from  humans  relative  to  flies. 
Moreover,  a  particular  fold  may  be  found  in  the  human  but  not  in  the  fly  simply  because 
not  all  the  fly  sequences  are  currently  known.  Focussing  only  on  organisms  for  which 
complete  genomes  are  known  eliminates  this  obvious  form  of  bias.  However,  there  is 
another  bias  that  is  not  overcome  by  knowledge  of  complete  genomes.  The  selection  of 
proteins  in  the  PDB  is  also  biased  by  the  preferences  of  individual  investigators  and  by 
the  physical  constraints  on  what  will  crystallize  (or  can  be  studied  by  NMR 
spectroscopy).  For  instance,  the  PDB  currently  contains  about  5500  entries  (5493 
identifiers  and  10781  domains).  This  total  includes  222  structures  for  T4  lysozyme,  but 
only  a  single  structure  for  the  “equally  important”  tyrosine  kinase  and  topisomerase-II 
proteins. 

Structures  in  the  PDB  are  also  biased  towards  certain  commonly  studied  organisms. 
Thus,  a  much  larger  percentage  of  folds  is  known  for  the  bacteria  Haemophilus  in 
comparison  to  the  archeon  Methanococcus,  even  though  both  have  roughly  the  same 
number  of  genes  [24]. 

Another  issue  related  to  the  state  of  the  structure  databank  is  that  the  absolute  counts 
found  in  a  given  genome  survey  are  contingent  on  the  evolving  contents  of  the  databank. 
Thus,  over  time  as  more  structures  are  added  to  the  databank,  one  should  expect  such 
statistics  as  the  most  common  folds  and  number  of  shared  folds  to  change  somewhat. 
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The  Multi-domain  Nature  of  Proteins  Creates  Counting  Problems 

A  second  type  of  bias  has  to  do  with  the  fact  that  protein  structure  is  fundamentally 
arranged  around  the  level  of  folding  domains  whereas  statistics  for  genomes  are  often 
calculated  and  best  understood  in  terms  of  the  number  of  genes  (Fig.  7).  For  instance, 
when  one  talks  about  how  prevalent  the  kinase  and  Rossmann  folds  are  in  the  yeast  and 
E.  coli  genomes,  one  is  implicitly  comparing  the  number  of  matches  that  known  kinase 
and  Rossmann  fold  structures  have  in  the  -6200  yeast  ORFs  relative  to  the  -4300  E.  coli 
ORFs.  However,  it  is  possible  for  a  single  gene  to  contain  a  number  of  kinase  fold 
domains  or  to  simultaneously  contain  both  a  kinase  and  Rossmann  fold.  Thus,  the  total 
number  of  domains  in  a  genome  is  probably  a  better  standard  for  these  comparisons. 
Unfortunately,  one  does  not  know  this  number.  But  one  does  know  that  the  number  of 
domains  is  not  related  simply  to  the  number  of  genes.  For  instance,  on  average  a  protein 
is  about  50%  larger  in  yeast  than  in  E.  coli  (317  vs.  466),  meaning  that  there  are  probably 
twice  as  many  possible  domains  in  yeast  as  in  E.coli. 

Another  problem  emanating  from  the  multi-domain  nature  of  proteins  is  highlighted 
in  Figure  7.  When  clustering  genes  based  on  their  sequence  similarities,  simple  single¬ 
linkage  clustering  can  give  potentially  misleading  results.  As  has  been  pointed  out 
before,  it  may  group  together  two  multi-domain  proteins  (AB  and  be)  containing  the  two 
unrelated  domain  folds  (A  and  c)  based  on  their  having  similarity  only  through  a  common 
domain  (B  and  b)  [42,  50]. 

Subtle  Biases  in  Comparison  Techniques 

A  final,  rather  subtle  form  of  bias  results  from  the  type  of  sequence  comparison 
method  used.  Different  pairwise  comparison  methods  (e.g.  Smith- Waterman  vs.  FASTA) 
and  different  thresholds  will  give  rise  to  different  absolute  numbers  of  fold  counts,  but 
the  relative  values  between  different  folds  will  usually  remain  comparable.  However,  as 
discussed  above,  there  are  other,  potentially  more  sensitive,  methods  of  comparing 
sequences  to  structures  -  e.g.  profiles,  HMMs,  and  motif  analysis,  and  threading  [55,  125, 
153-155].  These  latter  methods  find  more  homologues  for  certain  folds,  particularly  those 
for  which  multiple  alignments  are  available.  However,  the  sensitivity  improvement  is  not 
consistent  for  all  folds.  This  is  not  advantageous  for  a  large-scale  survey  where  uniform 
sampling  and  treatment  of  the  data  is  more  important  than  sensitivity.  One  is  more 
concerned  with  accurate  relative  numbers  than  with  absolute  values.  Cobbling  together  a 
survey  through  a  disparate  collection  of  tools  and  patterns  creates  the  problem  of  devising 
consistent  scores  and  thresholds.  This  problem  is  particularly  acute  in  the  case  of 
manually  derived  sequence  patterns  and  motifs,  since  an  expert  on  a  particular  fold  or 
motif  would  expect  his  pattern  to  find  relatively  more  homologues  than  a  pattern  not  so 
expertly  constructed.  The  simple  approach  of  just  using  pairwise  comparison,  applying 
the  same  objective  procedure  to  each  fold,  circumvents  these  problems  somewhat. 
Furthermore,  it  has  an  added  advantage  in  that  it  can  be  performed  automatically  without 
manual  intervention  and,  consequently,  can  easily  be  scaled  up  to  deal  with  large  data 
sets. 

Various  weighting,  sampling  and  clustering  schemes  attempt  to  correct  for  both 
obvious  and  more  subtle  biases  [156-160].  Potentially,  even  methods  developed  to 
correct  for  biases  in  governmental  censuses  may  be  of  use  [161,  162].  However,  in  a 
large-scale  structure  survey  nothing  can  really  make  up  for  essential  folds  that  are 
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missing. 

When  will  we  know  all  structures  in  a  genome? 

One  way  to  overcome  the  biases  in  the  databank  is  to  wait  until  we  know  all 
structures,  or  at  least  all  the  structures  in  a  number  of  genomes.  How  long  will  this  take? 
We  tried  to  answer  this  question  in  a  rough  fashion  by  doing  the  “back  of  the  envelope” 
calculation  shown  in  figure  8.  We  looked  at  how  the  fraction  of  structurally 
“uncharacterized”  genome  sequence  is  decreasing  each  year,  as  more  structures  are 
determined.  By  uncharacterized  sequence  we  mean,  regions  of  genome  sequence  that  are 
not  matched  by  a  known  fold  or  annotated  to  be  a  transmembrane  helix  or  low- 
complexity  region.  (Our  exact  definition  is  given  in  the  legend  to  the  figure.)  For  the 
purposes  of  this  calculation,  one  imagines  that  the  genomes  were  sequenced  in  1975. 
Then,  based  on  the  number  of  folds  known  in  that  year  the  fraction  of  uncharacterized 
region  is  computed.  The  same  thing  is  done  for  1976,  1977,  and  so  on.  Finally,  based  on 
the  values  for  all  genomes  over  the  last  ten  years,  a  trendline  is  extrapolated  to  zero 
uncharacterized  regions.  This  gives  the  rather  pessimistic  conclusion  that  all  the 
structures  will  not  be  known  until  2050. 

Our  conclusion  is  a  bit  more  conservative  than  other  estimates  [6,  11,  163],  which 
estimate  that  all  the  structures  in  certain  small  genomes  could  be  known  in  a  decade.  This 
is  due  to  a  number  of  reasons: 

(1)  The  statistics  here  are  in  terms  of  residues  rather  than  whole  sequences.  This 
helps  correct  for  the  “multi-domain”  counting  problem  discussed  above. 

(2)  The  trendline  is  based  on  the  average  of  eight  known  genomes,  rather  than 
focusing  on  the  smallest  one,  M.  genitalium,  which  Fischer  &  Eisenberg  [163] 
analyzed. 

(3)  Only  standard  sequence  comparison  rather  than  more  sensitive  threading 
techniques  were  used  to  match  sequences  with  structures.  Fischer  &  Eisenberg, 
for  instance,  reported  a  6%  improvement  in  sensitivity  over  standard  sequence 
comparison  when  using  their  threading  technique. 

(4)  The  method  of  estimating  folds  here  does  not  correct  the  duplications  that  may 
exist  in  the  uncharacterized  sequence  -  that  is,  unknown  folds  that  occur 
multiple  times.  These  may  reduce  the  number  of  future  structure  determinations 
necessary  to  match  all  the  genome  sequences.  However,  our  calculation  also 
does  not  correct  for  duplications  in  the  characterized  regions  -  that  is,  for 
known  folds  that  occur  multiple  times.  Thus,  we  hope  that  by  ignoring 
duplication  altogether  it  will  “cancel-out”  somewhat.  However,  if  unknown 
folds  were  significantly  more  duplicated  than  known  ones  this  would  tend  to 
inflate  the  time  necessary  to  determine  all  the  folds. 

Prediction  for  Characterizing  Sequences  without  a  Structural 
Homologue:  Methods 

Basic  Single-Sequence  Secondary  Structure  Prediction 

A  conservative  calculation,  thus,  shows  that  it  will  take  a  while  before  we  can 
truly  compare  microbial  genomes  in  terms  of  known  folds.  Consequently,  to  compare 
genomes,  today,  comprehensively  in  terms  of  protein  structure,  we  will  need  to  use 
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structure  prediction.  As  compared  to  counting  known  folds,  structure  prediction  has  both 
advantages  and  disadvantages  when  applied  to  genome  comparison.  On  the  plus  side,  it 
does  not  suffer  from  the  problem  of  biases  that  so  plagues  the  fold  counting,  since  it  can 
potentially  be  applied  uniformly  to  all  the  ORFs  in  a  genome.  However,  the  downside  is 
that  structure  prediction  is  inaccurate,  to  varying  degrees  (whereas  fold  counting  can  be 
made  almost  perfectly  “accurate”  with  suitably  severe  sequence  comparison  thresholds). 

Although  the  basic  hypothesis  that  the  amino-acid  sequence  completely  specifies 
the  3D  structure  of  a  protein  is  believed  to  be  valid,  no  current  “ab  initio”  method  has 
proven  successful  in  predicting  3D  structure  from  the  sequence  alone  [164-166]. 
Consequently,  by  structure  prediction  we  mean  more  limited,  one-dimensional, 
predictions  for  secondary  structure,  which  assign  individual  residues  in  the  protein 
sequence  to  discrete  states  likestrand,  coil,  or  helix  (soluble  or  transmembrane). 

One  of  the  most  straightforward  secondary  structure  prediction  methods  is  the  GOR 
method  [167-169].  This  is  a  well-established  and  commonly  used  approach.  It  is 
statistically  based  so  that  the  prediction  for  a  particular  residue  to  be  in  a  given  state  (say 
Ala  in  a  helix)  is  directly  based  on  the  frequency  that  this  residue  occurs  in  this  state  in  a 
database  of  solved  structures  (taking  into  account  neighbors  at  ±1,  ±2,  and  so  forth).  For 
the  most  up-to-date  version  of  the  program,  the  prediction  for  residue  i  is  based  on  a 
window  from  i-8  to  i+8  around  i,  and  within  this  window,  the  17  individual  residue 
frequencies  (singlets)  are  combined  with  the  frequencies  of  all  136  possible  di-residue 
pairs  (doublets)  [167]. 

Multiple-Sequence  Secondary  Structure  Prediction,  Improved  Accuracy  but 
Some  Pitfalls 

The  GOR  method  has  a  well-documented  single- sequence  prediction  accuracy  of 
65%.  This  is  considerably  lower  than  the  current  “state-of-the-art”  methods  that 
incorporate  multiple  sequence  information.  In  particular,  Rost  &  Sander  used  a  two¬ 
layered  neural  network  trained  on  a  non-redundant  database  of  130  protein  chains  to 
predict  the  secondary  structure  [170].  If  they  include  protein  family  information  in  the 
form  of  multiple- sequence  alignments,  they  get  an  overall  three-state  accuracy  of  71%. 
Salamov  &  Solovyev’s  nearest-neighbor  algorithms  give  slightly  better  results  (three- 
state  accuracy  to  72.2%)  [171].  The  DSC  method  (Discrimination  of  Secondary  structure 
Class),  which  is  very  similar  in  conception  to  GOR  but  uses  multiple  sequences,  achieves 
70.1%  accuracy  [172].  Finally,  the  method  of  Livingstone  &  Barton  [173]  groups 
residues  based  on  the  similarities  and  differences  in  their  physicochemical  properties, 
achieving  a  similar  accuracy. 

The  conspicuous  agreement  in  accuracy  among  the  multiple-alignment  methods 
(-70%)  may  be  related  to  a  baseline  level  of  agreement  between  the  secondary  structure 
of  two  proteins  both  having  the  same  3D  fold  [174].  Note,  however,  that  using  multiple 
alignment  methods  comprehensively  on  genomes  introduces  subtle  biases  into  the 
analysis.  One  only  gets  higher  accuracy  where  one  can  construct  a  multiple  alignment. 
However,  it  is  not  possible  to  obtain  multiple  sequence  alignments  for  most  of  the 
proteins  in  a  genome.  Consequently,  bulk  predictions  of  all  the  proteins  in  a  genome 
based  on  multiple- alignment  approaches  are  in  a  sense  skewed.  One  gets  two  distinctly 
different  types  of  prediction,  depending  on  how  many  homologues  a  given  protein  has. 
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Transmembrane-Helix  Prediction 

Several  prediction  methods  have  been  developed  for  transmembrane  helices. 

Some  of  them  are  based  on  parameters  derived  from  the  intrinsic  properties  of  amino 
acids,  usually  their  oil-water  transfer  energies.  A  widely  used  example  is  the  GES 
hydrophobicity  scale  [175].  To  use  this,  one  calculates  the  hydrophobic  character  of  each 
20-residue  sequence  span  (the  typical  length  of  a  transmembrane  helix)  using  the  values 
in  the  scale  and  compares  them  against  a  cutoff  (usually  -1  kcal/mo  fares  idue).  A  value 
under  this  cutoff  is  taken  to  indicate  the  existence  of  a  transmembrane  helix.  Similar 
approaches  were  taken  by  other  authors,  who  used  different  scales,  e.g.  the  Kyte-Doolittle 
or  the  Eisenberg  scales  [92,  176-179]. 

Other  transmembrane-prediction  methods  involve  accumulating  statistics  on  the 
small  set  of  known  membrane  proteins  in  the  databanks,  calculating  “propensity”  values 
for  each  position  in  the  sequence  or  using  neural  networks  [180-182].  Both  these 
approaches  utilize  multiple-alignment  information  to  improve  accuracy  [180,  182]. 
Furthermore,  by  analyzing  compositional  differences  between  the  membrane- spanning 
segments,  they  can  predict  not  only  the  location,  but  also  the  orientation  of  the  helices, 
based  on  the  observation  that  positively  charged  residues  are  more  abundant  on 
cytoplasmic  side  of  the  membrane  [178-180,  183]. 

There  is  a  subtle  problem  that  exists  in  databank-derived  membrane  protein 
structure  potentials.  Since  there  are  so  few  known  membrane  protein  structures,  each  of 
them  rather  strongly  affects  the  potentials.  Moreover,  many  of  the  sequences 
characterized  as  "gold-standard"  membrane  proteins  were  in  fact  determined  to  be  such 
by  their  original  depositors  through  the  application  of  computer  programs,  not  by 
experiment.  These  then  are  often  carelessly  used  as  training  data  later.  Thus,  one  has 
prediction  serving  as  data,  again  biasing  the  potentials  to  find  more  of  what  we  already 
know.  This  problem  even  exists  in  regard  to  membrane  proteins  characterized  by 
experiment,  as  even  in  this  case  the  exact  boundary  of  the  TM  helix  is  often  determined 
through  application  of  computer  programs. 

Prediction  for  Characterizing  Sequences  without  a  Structural 
Homologue:  Results 

When  applied  in  bulk  to  the  currently  known  genomes,  secondary  structure 
prediction  has  shown  that  many  microbial  genomes  have  remarkably  similar  composition 
of  helices  and  strands  (by  residue,  40%  helix  and  17%  strand,  and  by  element,  half-and- 
half)  (Table  3)  [24,  184].  Furthermore,  the  occurrence  of  all-alpha,  all-beta  and  mixed 
architectures  appears  also  to  be  very  similar  [185].  This  result  is  rather  surprising  when 
one  considers  that  the  genomes  have  significantly  different  amino  acid  composition  and 
different  amino  acids  have  different  physical  propensities  to  confer  secondary  structure 
[24].  There  are,  however,  some  differences  in  the  occurrence  of  super- secondary  structure 
elements. 

There  have  also  been  many  surveys  of  the  occurrence  of  membrane  proteins  in 
genomes  [24,  39,  149,  164,  182,  186-190,  191  ].  The  overall  number  of  membrane 
proteins  found  depends  somewhat  on  the  prediction  method  and  threshold  used. 
Nevertheless,  there  seems  to  be  a  broad  agreement  that  -20-30%  of  the  proteins  in 
microbial  genomes  are  membrane  proteins,  with  yeast  having  a  slightly  larger  fraction. 
Membrane  protein  structures  can  be  classified  by  how  many  transmembrane  (TM)  helices 
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they  have.  In  all  the  surveys,  the  number  of  TM-helices  per  protein  follows  a  similar 
decreasing  pattern  in  each  genome,  with  fewer  proteins  having  large  numbers  of  TM- 
helices. 

To  summarize  this  data,  one  can  plot  the  fraction  of  proteins  with  a  given  number 
of  TM-helices  on  a  log-log  scale  and  get  a  straight  line  trend  as  shown  in  Figure  9  [149]. 
The  fraction  F  of  proteins  in  the  genome  with  a  given  number  n  of  TM-helices  can  be  fit 
with  the  expression  F(n)  =  .18  n’  ,  where  n  ranges  from  0  to  15.  (Without  great 
degradation  of  the  fit,  the  even  simpler  expression  l/[5n  ]  can  be  used  as  well.)  This 
expression  has  a  form  like  that  of  the  Zipf’s  Law  that  often  occurs  in  the  analysis  of  word 
frequency  in  documents  [192].  Similar  Zipf-law-like  expressions  have  been  found  to 
apply  in  a  variety  of  other  situations  relating  to  the  occurrence  of  proteins  (e.g.  in  relation 
to  the  occurrence  of  oligopeptide  words  [193-195]).  Moreover,  this  particular  functional 
form  for  the  occurrence  of  proteins  with  a  given  number  of  TM-helices  falls  off  smoothly 
with  increasing  numbers  of  helices.  This  implies  that  there  is  no  particular  preference  (i.e. 
local  maximum)  for  proteins  with  seven  TM-helices  and,  thus,  suggests  that  this  heavily 
studied  group  of  proteins,  which  includes  G-protein  coupled  receptors,  is  not 
exceptionally  prevalent  in  microbial  genomes. 

Most  of  the  membrane-protein  surveys  agree  on  this  absence  of  7-TM  proteins  in 
microbial  genomes;  some  also  claim  to  find  more  6  and  12  TM  proteins  in  bacterial 
genomes  corresponding  to  well  known  families  of  transporter  proteins  [24,  187,  189, 

191].  In  contrast,  surveys  of  the  incomplete  (and  highly  biased)  set  of  human  sequences 
and  the  unfinished  worm  genome  find  a  relative  abundance  of  7-TM  proteins  in  these 
multi-cellular  organisms  [187,  191]. 

Discussion  and  Conclusion 

Summary 

We  have  described  how  genomes  can  be  compared  in  terms  of  protein  structure.  As 
the  number  of  genes  in  a  genome  and  the  total  number  of  folds  in  nature  are  both  quite 
limited,  these  comparisons  take  the  form  of  surveys,  which  we  dub  censuses,  of  finite 
parts  lists.  Surveys  of  the  occurrence  of  protein  folds  in  genomes  have  many  similarities 
with  other  types  of  whole-genome  “occurrence”  analyses,  focussing,  say  on  motifs  or 
pathways.  However,  structure  has  a  number  of  special  aspects  that  make  it  particularly 
advantageous  for  genome  comparison.  It  has  a  more  certain  relation  to  sequence 
similarity  than  does  function,  and  it  allows  for  precise  definition  of  module,  or  basic  unit. 
An  essential  element  for  a  structure  census  is  a  library  of  protein  folds  that  arranges  all 
the  known  structures  into  “fold  families.”  We  described  how  this  library  could  be  built  up 
by  using  an  automatic  comparison  program.  We  show  how  important  statistics  are  for 
defining  the  similarities  within  this  library  and  between  templates  in  the  library  and 
genome  sequences.  After  building  the  library  one  can  count  folds  in  genomes,  expressing 
the  results  in  the  form  of  Venn  diagrams  and  “top- 10”  statistics  for  shared  and  common 
folds.  Previous  analyses  have  shown  that  folds  shared  between  very  different 
microorganisms  -  i.e.  between  those  in  three  different  kingdoms  -  have  a  remarkably 
similar  structure,  being  comprised  of  repeated  beta-alpha-beta  super-secondary-structure 
units.  A  major  problem  with  this  sort  of  analysis  and  fold  counting  in  general  is  that  only 
a  small  subset  of  the  folds  in  a  complete  genome  are  currently  known  and  this  subset  is 
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prone  to  various  forms  of  sampling  bias.  One  way  of  overcoming  biases  is  through 
structure  prediction,  which  can  be  applied  comprehensively  to  all  ORFs  in  a  genome. 
There  are  many  variants  on  the  principal  prediction  techniques  for  secondary  structure 
and  transmembrane-helices  (TM-helices).  These  have  been  applied  in  a  comparative 
sense  to  a  number  of  genomes  by  various  investigators,  giving  similar  results:  that  the 
fraction  strands  and  helices  in  a  number  of  genomes  is  approximately  constant  and  that 
the  fraction  of  proteins  with  a  given  number  of  TM-helices  falls  off  with  more  TM 
elements  approximately  according  to  Zipf  law.  This  latter  result  indicates  that  there  is  no 
preference  for  the  highly  studied  7-TM  proteins  in  microbial  genomes. 

Continuously  updated  tables  and  further  information  pertinent  to  this  review  is 
available  over  the  web  at  http://bioinfo.mbb.yale.edu/genome. 

General  Perspective  on  the  Scale  of  a  Genome  Survey 

As  a  concluding  point,  it  is  worthwhile  to  put  the  scale  of  the  genome  surveys  into  a 
broader  context.  As  described  above,  it  is  believed  that  there  are  roughly  1000  folds  (i.e. 
fundamental  objects)  in  nature.  These  can  be  arranged  into  a  fold  library,  and  when 
completed  this  fold  library  will  constitute  a  most  important  “data  structure”  in  molecular 
biology.  How  does  it  compare  with  the  fundamental  data  structures  in  other  scientific 
disciplines?  As  shown  in  figure  10,  in  physics  there  are  ~10  basic  data  objects,  the 
fundamental  constants,  the  speed  of  light,  the  mass  of  an  electron,  etc.  This  is  a  small 
enough  number  so  that  one  can  keep  them  all  in  memory.  Physicists  understand  the  world 
through  constructing  intricate  mathematical  relationships  between  these  constants  and 
actual  physical  observables.  In  chemistry,  there  are  about  an  order  of  magnitude  more 
fundamental  data  objects,  the  1 13  chemical  elements.  This  is  too  many  things  to  keep  in 
one's  head  at  once  so  usually  these  elements  are  written  down  on  a  page  in  the  form  of 
the  periodic  table.  Chemists  understand  the  world  by  seeing  trends  and  periods  in  this 
table.  In  (molecular)  biology  we  expect  to  have  at  least  an  order  of  magnitude  more  data 
objects  in  the  fold  library  than  the  elements  in  the  periodic  table.  Moreover,  each  fold 
represents  a  substantially  more  complex  entity  than  a  physical  constant  or  chemical 
element.  Consequently,  it  is  not  possible  to  keep  the  fold  library  in  one's  head  or  write  it 
down  on  a  piece  of  paper.  It  has  to  be  stored  in  a  computer  database.  What  type  of 
understanding  can  we  expect  from  this  database,  which  we  can  carry  with  us  in  our 
minds?  It  is  not  going  to  be  mathematical  relationships  as  in  physics;  rather,  it  is  going  to 
be  statistics,  in  the  sense  of  the  top- 10  fold  list  and  the  P-value  for  similarities,  discussed 
earlier.  Thus,  our  goal  in  these  large-scale  surveys  is  really  statistical  understanding. 

It  also  interesting  to  note  that  the  data-set  scale  in  molecular  biology  is 
approximately  the  same  as  that  faced  by  economists  and  financiers  studying  the  stock 
market.  The  stock  market  contains  roughly  1000-10,000  well-defined  objects,  i.e.  public 
companies.  While  it  is,  of  course,  possible  to  study  these  each  individually,  it  is  not 
possible  to  do  this  for  all  companies  simultaneously,  and  economists  understand  the  stock 
market  through  various  key  statistics,  summarizing  large  groups  of  companies  (e.g. 
market  indices  for  various  sectors).  The  next  larger  size  data  set  occurs  in  other  branches 
of  social  sciences,  such  as  demographics  and  political  science,  when  one  is  concerned 
with  surveys  of  whole  populations,  as  in  political  polls.  Here  the  number  of  fundamental 
data  objects  can  easily  exceed  1  million.  However,  the  exact  number  and  description  and 
of  each  data  object  is  no  longer  possible,  even  in  the  context  of  a  computer  database,  so 
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one  is  no  longer  surveying  a  finite  list,  but  rather  sampling  a  large  population  to  estimate 
various  statistics. 
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Table  1,  Current  Microbial  Genomes  Available 


Abbrev 

Genome 

Reference 

HI 

Haemophilus 

influenzae 

[196] 

MG 

Mycoplasma 

genitalium 

[197] 

MJ 

Methanococcus 

jannaschii 

[198] 

SS 

Synechocystis  sp . 

[199] 

MP 

Mycoplasma 

pneumoniae 

[200] 

SC 

Saccharomyces 

cerevisiae 

[5] 

HP 

Helicobacter  pylori 

[190] 

EC 

Escherichia  coli 

[201] 

MT 

Methanobacteri urn 
thermoautotrophicum 

[202] 

BS 

Bacillus  subtilis 

[203] 

AB 

Archaeoglobus 

fulgidus 

[204] 

BB 

Borrelia 

burgdorferi 

[39] 

AA 

Aquifex  aeolicus 

[205] 

Table  lists  currently  published  microbial  genomes,  which  are  discussed  in  text.  This  table 
will  rapidly  be  out  of  date,  so  it  is  probably  best  to  consult  a  website,  such  as  the  TIGR 
microbial  database  (http://www.tigr.org/tdb/mdb/mdb.html)  or  our  own  genomes  and 
structures  site  (http://bioinfo.mbb.yale.edu/genome). 


28 


Table  2,  Common  Folds  Ranked  by  Duplication  and  Expression 


Rep. 

Structure 

Genome 

Duplication 

Expression 

(aerobic) 

Expression 

(anaerobic) 

scop  fold  name 

lhcl 

3 

4 

Protein  kinases  (PK),  cat.  core 

lgky 

2 

2 

NTP  Hydrolases  with  P-loop 

lard 

3 

9 

5 

Classic  zinc  finger 

2rn2 

4 

2 

Bl 

Ribonuclease  H-like  motif 

lxel 

5 

4 

3 

Rossmann  Fold 

125d 

6 

6 

7 

Zn2/Cys6  DNA-binding  dom. 

2bbk-H 

7 

8 

16 

7-bladed  beta-propeller 

lbyb 

8 

5 

6 

TIM-barrel 

lfxd 

9 

7 

10 

like  Ferrodoxin 

lenh 


llep-A 


ldkz-A 


10 


10 


DNA-binding  3-helical  bundle 


GroES-like 


like  HSP70,  Ct-dom. 


This  table  shows  the  most  common  folds  in  the  yeast  genome  ranked  according  to  a 
variety  of  criteria.  Column  5  (“name”)  gives  the  scop  name  for  the  fold,  as  determined  by 
scop  [7]  (In  the  table  "dom"  is  used  as  an  abbreviation  for  domain,  "Nt-dom,"  for  N- 
terminal  domain,  and  "Ct-dom,"  for  C-terminal  domain.)  Column  1  (“Rep.  Struc.”)  gives 
a  representative  structure  with  this  fold,  including  residue  selection.  Column  2 
(“Duplication”)  gives  an  ordering  of  folds  in  terms  of  the  number  of  times  they  are  found 
in  the  yeast  genome.  For  instance,  the  top  fold  (kinase)  is  found  1 10  times,  while  the 
second  fold  (NTP  hydrolase)  is  found  69  times.  Columns  3  and  4  (“expression”)  give  an 
ordering  of  folds  in  terms  of  their  degree  of  expression.  Using  the  data  from  DeRisi  et  al. 
[152],  the  total  expression  E  of  a  fold  F  is  calculated  as  a  sum  of  the  expression  levels  of 
all  the  ORFs  that  contain  this  fold.  The  expression  level  of  a  given  ORF  (i.e.  ORF  i)  is 
the  degree  of  its  “Red”  color  on  a  cDNA  microarray  R(i),  less  the  background  Rback(i)> 
viz:  E(F)  =  ^R(i)  -  Rback(i )  .  Column  3  gives  the  expression  in  aerobic  conditions 

V  i  containing  F 

(high  sugar,  second  time-series  data  point  in  DeRisi  et  al.),  and  Column  4,  in  anaerobic 
conditions  (low  sugar,  high  ethanol,  last  time-series  data  point  in  DeRisi  et  al.).  Note  how 
some  folds  that  are  in  the  top- 10  in  terms  of  duplication  are  not  in  this  select  list  in  terms 
of  expression  (e.g.  “DNA-binding  3-helical  bundle”).  The  table  is  adapted  from  [149].  It 
is  available  in  its  entirety  (i.e.  not  just  top-10)  over  the  web  at 
http  ://bioinfo  .mbb .  y  ale .  edu/genome/bro  w  ser/fold-report . 
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Table  3,  Overall  Predicted  Secondary  Structure  Composition 


Total  Number 

Frac.  a.a.  in... 

ORFs 

a.a. 

strand 

helix 

Avg 

2206 

775998 

17% 

SD 

1731 

1% 

EC 

4290 

1 358465 

17% 

39% 

HI 

1680 

505279 

16% 

41% 

HP 

1577 

500616 

15% 

42% 

MG 

468 

170400 

17% 

39% 

MJ 

1735 

497968 

19% 

37% 

MP 

677 

237905 

17% 

39% 

SC 

6218 

2900670 

17% 

34% 

SS 

3168 

1 033450 

16% 

38% 

Secondary  structure  composition  of  eight  genomes,  as  predicted  by  the  GOR  program 
[167],  applied  to  every  amino  acid  (a. a.)  in  each  genome.  This  gives  a  somewhat  lower 
fraction  of  helix  than  in  one  just  predicts  the  structure  of  the  uncharacterized  regions  as 
defined  in  Figure  8.  Genome  names  are  defined  in  Table  1.  Table  is  adapted  from  [184] 
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Figure  1,  At  What  Structural  Resolution  do  Organisms  Differ? 

Schematic  illustrates  a  question  involved  in  comparing  genomes  in  terms  of  protein  folds. 
Different  organisms  (e.g.  a  yeast  and  E.  coli  or  a  person  and  a  plant)  clearly  appear 
morphologically  distinct  at  macroscopic  resolution  (1  m  to  10"  ’  m).  However,  they  look 

o 

the  same  at  truly  atomic  resolution  (~1  A),  where  they  are  composed  of  similar 
proportions  of  the  organic  elements.  At  what  resolution  does  one  start  to  see  differences? 

Figure  2,  Relationship  between  Sequence  Similarity  and 
Structural  Similarity 

This  figure  gives  an  updated  version  of  Chothia  &  Lesk’s  classic  plot  relating  divergence 
in  sequence  to  divergence  in  structure  [64] .  In  the  original  plot  Chothia  &  Lesk  aligned 
32  pairs  of  homologous  structures  (e.g.  globins  from  two  different  species).  For  each  pair 
they  calculated  a  sequence  similarity,  in  terms  of  a  percentage  of  identical  residues  for 
aligned  atoms  (“PID”),  and  a  structural  similarity,  expressed  as  the  RMS  deviation  in 
alpha  carbon  positions  of  aligned  atoms  (“the  RMS”).  They  found  that  the  two  quantities 
appeared  to  be  highly  related,  following  the  relationship  RMS  =  0.4  exp  (1.87  PID).  To 
update  this  plot,  we  used  a  much  larger  data  set,  the  scop  classification  of  protein 
structure,  version  1.35  [7].  This  data  set  contains  more  than  14000  pairs  of  similar 
structure.  (We  used  exactly  13967  pairs.)  Instead  of  describing  sequence  similarity  of 
each  pair  in  terms  of  percentage  identity,  we  used  the  more  modern  statistical  language, 
the  P-value  [94,  136].  Then  depending  on  whether  or  not  a  given  pair  had  any  appreciable 
sequence  similarity,  we  aligned  it,  either  using  standard  Needleman-Wunsch  sequence 
comparison  or  a  structural  alignment  program  [90,  96],  and  did  a  least-squares  fit  based 
on  the  aligned  atoms.  This  allowed  us  to  characterize  the  structural  similarity  of  the  pair 
with  two  numbers,  an  RMS  and  the  number  of  aligned  atoms  (N).  For  all  the  pairs  within 
a  range  of  sequence  alignment  scores  (i.e.  a  bin),  we  calculated  various  RMS  statistics, 
mean,  median,  and  top  and  bottom  quartile.  Finally,  we  graphed  these  quantities  versus 
sequence  similarity  (P-value).  This  plot  shows  a  similar  relationship  between  sequence 
and  structure  as  in  the  original  work  of  Chothia  &  Fesk. 

Figure  3,  A  Fold  Template 

TOP-FEFT  shows  a  structural  alignment  of  two  similar  protein  structures  (globins).  TOP- 
RIGHT  shows  how  a  number  of  aligned  structures  can  be  fused  into  a  “fold  template,” 
where  the  variability  at  each  aligned  position  is  represented  with  an  “uncertainty 
ellipsoid.”  A  large  number  of  these  fold  templates  could  constitute  a  fold  library. 
BOTTOM  shows  the  fold  template  in  terms  of  sequence.  Note  how  the  conserved,  “core” 
regions  are  disjoint  in  terms  of  sequence. 

Figure  4,  The  Range  of  Structural  Similarities 

FEFT,  An  easy  to  align  pair,  two  globins.  The  aligned  positions  are  indicated  by  small, 
gray  CPK  spheres.  Most  of  the  residues  are  aligned  correctly.  CENTER,  A  harder  to 
align  pair,  an  immunoglobulin  light-chain  variable  domain  (d7fabl2)  and  an 
immunoglobulin  constant  domain  (dlreial).  RIGHT,  A  very  hard  to  align  pair,  the  C- 
terminal  domain  of  C-terminal  domain  glyceraldehyde-3-phosphate  dehydrogenase.  The 
left  half  of  the  subfigure  shows  wire  frames,  which  illustrate  how  hard  the  relationship  is 
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to  see.  These  structures  (dlgdlo2  in  the  TOP-HALF  and  dldpga2  at  BOTTOM-HALF) 
are  considered  to  share  the  fold.  This  is  highlighted  in  the  ribbon  diagram  (RIGHT- 
RIGHT)  and  indicated  in  the  topology  diagram.  This  figure  is  adapted  from  [90]. 

Figure  5,  Distribution  of  Known  Folds  amongst  the  Genomes 

This  figure  is  adapted  from  [24,  147].  At  the  time  this  analysis  was  done  there  were  300 
known  folds,  somewhat  less  than  at  present  (-350).  TOP,  of  the  total  300  folds,  148 
appear  in  the  3  genomes,  with  45  shared  between  all  three.  The  abbreviations  for  the 
three  genomes  are  shown  in  Table  1.  Most  of  the  folds  are  either  in  the  HI  or  SC 
genomes,  even  though  the  HI  genome  is  smaller  the  MJ  one.  This  reflects  the  bias  in  the 
structure  databank.  BOTTOM  shows  how  the  300  folds  are  distributed  amongst  ALL 
bacterial  and  eukaryotic  sequences,  showing  how  representative  a  genome  is  for  a  whole 
kingdom. 

Figure  6,  Five  Folds  common  in  All  Three  Kingdoms 

The  figure  shows  five  basic  molecular  parts,  five  folds  that  are  shared  by  SC,  HI,  and  MJ 
and  are  common  in  each  of  the  three  genomes.  Here  “commonness”  is  determined  by  the 
average  frequency  rank  of  the  fold  over  each  of  the  three  genomes.  All  folds  are  drawn 
with  molscript  [206].  Also  shown  are  highly  schematic  views  of  the  sheet  topology. 

Boxes  indicate  parallel  strands  in  a  beta-sheet  with  their  order  noted.  (Strands  are  coming 
out  of  the  page.)  Solid  arcs  joining  the  boxes  indicate  right-handed  connections  between 
the  parallel  strands.  All  of  these  involve  skipping  no  more  than  2  strands  and  are  through 
a  parallel  helix  packed  onto  the  sheet,  from  above  or  below.  Half  of  an  arc  indicates  that 
there  is  a  parallel  helix  connected  to  either  the  first  or  last  strand  of  the  sheet.  There  is 
one  exceptional  connection,  indicated  with  a  dotted  line:  In  the  Rossmann  fold  there  is  a 
connection  across  3  strands  through  a  parallel  helix.  This  figure  is  adapted  from  [24]. 

Figure  7,  Issues  Associated  with  the  Multi-domain  Nature  of 
Proteins 

This  schematic  highlights  that  fact  that  a  given  ORF  can  contain  many  structural  features. 
TOP,  Various  regions  of  a  representative  ORF  are  annotated  with  different  structural 
features,  such  as  transmembrane  helices  or  homology  to  known  structure.  Sometimes 
these  features  overlap,  as  is  often  the  case  for  TM-helices  and  low-complexity  regions. 
After  "masking"  the  first  four  structural  features  (PDB  matches,  low-complexity  regions, 
TM-helices,  and  linkers),  one  is  left  with  uncharacterized  regions,  which  can  be 
characterized  by  a  limited  amount  of  structure  prediction.  BOTTOM  shows  that  having 
multiple  domains  introduces  complexity  in  clustering  sequences.  Naively  applied  single¬ 
linkage  clustering  will  group  together  two  sequences  (i.e.  2  and  4)  that  have  similarity  to 
different  domains  (B  and  C)  in  a  third,  intermediate  sequence  (3).  TOP  is  adapted  from 
[24], 

Figure  8,  When  will  all  Structures  in  Genome  be  Known? 

This  figure  attempts  to  determine  when  all  the  structures  will  be  known  for  the  proteins  in 
a  complete  genome.  The  TOP  panel  shows  how  the  fraction  of  amino  acids  characterized 
in  eight  genomes  increases  each  year  with  the  addition  of  new  structures  to  the  PDB  — 
imagining  that  the  complete  sequences  of  the  eight  genomes  were  known  a  quarter 
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century  ago.  A  loose  "back-of-the-envelope"  trendline  is  fit  to  the  increase  in  the  last 
decade.  In  the  BOTTOM  panel,  this  trendline  is  extrapolated  to  the  point  when  all 
structures  in  the  genomes  are  known,  which  is  rather  pessimistically  estimated  to  be 
around  2050.  Characterized  regions  are  structural  features,  as  shown  Figure  7.  They  are 
either  PDB  matches  (as  determined  by  the  FASTA  program),  TM-helices  (identified  as 
described  in  figure  9),  low-complexity  regions  (identified  using  the  SEG  program  [207, 
208]),  or  linkers  (short  stretches  of  less  than  50  residues  linking  two  the  previous 
elements).  Abbreviations  for  the  genomes  are  in  Table  1. 

Figure  9,  Transmembrane  Folds  in  Microbial  Genomes 

This  log-log  graph  shows  the  occurrence  of  membrane  proteins  with  a  given  number  of 
transmembrane  (TM)  helices  in  each  of  the  eight  genomes.  Abbreviations  are  defined  in 
Table  1.  The  occurrence  drops  off  in  a  similar  fashion  in  all  eight  genomes,  according  to 
a  Zipf-like  law,  and  a  fit  to  all  eight  is  shown  in  the  graph.  The  transmembrane  segments 
were  identified  using  the  GES  hydrophobicity  scale  [175].  Figure  is  adapted  from  [149]. 

Figure  10,  Scale  of  the  Data:  Molecular  Biology  vs.  Other 
Disciplines 

Schematic  illustrates  the  scale  of  the  fundamental  data  set  in  molecular  biology,  the  table 
of  folds,  in  comparison  to  data  sets  in  other  disciplines.  The  table  of  folds  is  expected  to 
contain  between  1000  and  10000  objects.  This  is  larger  in  scale  than  the  fundamental  data 
in  physics  and  chemistry  (~10  fundamental  constants  in  physics  and  -100  elements  in 
chemistry),  about  the  same  size  as  a  fundamental  data  set  in  finance  (the  -1000-10000 
companies  traded  on  the  stock  market),  and  smaller  than  data  sets  commonly  used  in 
politics  and  demographics  (>1,000,000  individuals  in  a  state). 
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